This interactive report demonstrates several practical data science analyses applied to a scientific publications dataset. You will see a sequence of visualizations that together explore: how topics evolve over time, textual stylistic differences across prolific authors, the relationship between reported replicability stamps and impact (citations & downloads), a simple award prediction model evaluation, and a duplicate-title detection heatmap. Each figure is accompanied by an accessible, educational explanation of the algorithm used, why it was chosen, how to read the plot, and what an important finding would look like.
The goal is to provide an approachable, reproducible dashboard that both explains the underlying methods and surfaces actionable insights. Use the controls above each plot to switch views (for example, absolute vs relative topic prevalence or raw vs log-scale comparisons) and hover to reveal details.
This figure uses TF-IDF to convert each paper's title, abstract, and keywords into a vector representation and then applies truncated SVD for compactness before clustering with KMeans to form topics. KMeans is used because it is fast, deterministic (with fixed random seed), and yields easily interpretable clusters for exploration. This pipeline (TF-IDF → SVD → KMeans) is a common lightweight approach to topic discovery when you need quick, scalable groupings without heavy language model compute.
The stacked area plot shows the relative prevalence of each discovered topic per year (you can toggle to absolute counts). - X axis: Year. Y axis: fraction of papers in each topic (stacked to 1). Each color is a topic; the linked keywords above the plot summarize what terms drive that topic. - Interpretation: a rising colored band means that topic is becoming more common over time. A shrinking band means declining interest. - Impactful findings: sustained growth of a topic's relative share across several years suggests an emerging research area. Sudden spikes may indicate one-time workshops or trend effects.
Main takeaway: If a topic shows consistent upward trend across years, it signals an emerging or rapidly growing research direction; declining or flat topics indicate stable or waning interest.
From here, researchers could drill down by filtering papers within a topic to inspect representative abstracts or use more advanced dynamic topic models (e.g., BERTopic) to capture topic drift more precisely.
This comparison extracts simple stylometric features from paper abstracts (average words per sentence, average word length, abstract length, title length). Violins visualize the distribution of each metric for the most prolific authors. We select these features because they are intuitive, computationally cheap, and effective at capturing broad writing-style differences. Violin plots are selected to display full distributional shape rather than just summary statistics.
How to read: each violin represents the distribution of the chosen metric for an author: wide regions indicate many papers with that value, narrow regions fewer. The inner box shows the interquartile range and the mean line indicates the average. This helps detect authors with consistently shorter abstracts, longer sentences, or unusual word-length patterns. Significant findings would include an author exhibiting consistently shorter abstracts or a markedly different average words-per-sentence compared to peers, which might reflect a distinct writing style or editorial constraints.
Main takeaway: Consistent differences in stylometric features across authors can reveal distinct writing conventions or editorial norms; large deviations may warrant further investigation (e.g., author-specific practices or potential copy-paste patterns).
Next steps might include training a stylometric classifier to attribute anonymous text to likely authors or to highlight anomalous submissions for manual review.
This section compares papers with and without a reported graphics replicability stamp across two impact measures: CrossRef citations and IEEE Xplore downloads. Because raw citation and download counts are often heavy-tailed, the visualization offers both raw and log1p views and a jittered point view to inspect individual observations. We use boxplots for distributional comparisons (median, IQR, whiskers) and jittered points to reveal outliers.
How to read: each panel shows the distribution for stamped vs not-stamped papers. - If the 'Stamped' group has a visibly higher median box and/or a generally higher distribution in both citations and downloads, that suggests an association between reported replicability practices and impact. - However, causality is not established here — confounding factors (conference, year, topic) can influence both replicability reporting and impact. Important signals: consistently higher medians and shifted distributions for stamped papers across both metrics would be noteworthy.
Main takeaway: If stamped papers systematically show higher citations and downloads, this suggests replicable research may correlate with higher scholarly impact — but follow-up causal analysis is needed to confirm causation.
Next steps could include causal inference (propensity score matching or difference-in-differences) controlling for confounders like conference and year to estimate the replicability stamp's effect on impact.
We build a simple, interpretable baseline model to predict whether a paper received an award using numeric metadata (e.g., page count, author count, citations, downloads) and TF-IDF features from the title. Logistic regression with balanced class weights is chosen because it provides well-calibrated probabilities and coefficients that are easy to interpret. Given the rarity of awards, we emphasize cross-validated evaluation (ROC and Precision-Recall) to account for class imbalance.
How to read: the left panel shows ROC curve (trade-off between true positive and false positive rates) and its AUC. The middle panel is the Precision-Recall curve which is more informative under class imbalance; a higher area indicates better precision at high recall values. The right panel shows the most influential numeric features (by coefficient magnitude) and the sign indicates whether a higher value increases or decreases award probability. Impactful findings would include a model with strong PR-AUC (much higher than random) and interpretable features consistent with domain knowledge (for example, higher early downloads predicting awards).
Main takeaway: Interpretable models can surface which observable features are associated with awards; however, due to label sparsity and temporal changes, this is a hypothesis-generating step rather than conclusive prediction.
Recommended next steps: add richer text embeddings, perform temporal holdout validation, and use SHAP to explain predictions at the paper level.